9 research outputs found

    Automatic Synthesis of Regular Expressions from Examples

    Get PDF
    We propose a system for the automatic generation of regular expressions for text-extraction tasks. The user describes the desired task only by means of a set of labeled examples. The generated regexes may be used with common engines such as those that are part of Java, PHP, Perl and so on. Usage of the system does not require any familiarity with regular expressions syntax. We performed an extensive experimental evaluation on 12 different extraction tasks applied to real-world datasets. We obtained very good results in terms of precision and recall, even in comparison to earlier state-of-the-art proposals. Our results are highly promising toward the achievement of a practical surrogate for the specific skills required for generating regular expressions, and significant as a demonstration of what can be achieved with GP-based approaches on modern IT technology

    Ingegnerizzazione di Algoritmi di Machine Learning

    No full text
    2009/2010Nowadays the available computing and information-storage resources grew up to a level that allows to easily collect and preserve huge amount of data. However, several organizations are still lacking the knowledge or the tools to process these data into useful informations. In this thesis work we will investigate several issues that can be solved effectively by means of machine learning techniques, ranging from web defacement detection to electricity prices forecasting, from Support Vector Machines to Genetic Programming. We will investigate a framework for web defacement detection meant to allow any organization to join the service by simply providing the URLs of the resources to be monitored along with the contact point of an administrator. Our approach is based on anomaly detection and allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without requiring any prior knowledge about those resources—thus being an unsupervised system. Furthermore, we will test several machine learning algorithms normally used for anomaly detection on the web defacement detection problem. We will present a scrolling system to be used on mobile devices to provide a more natural and effective user experience on small screens. We detect device motion by analyzing the video stream generated by the camera and then we transform the motion in a scrolling of the content rendered on the screen. This way, the user experiences the device screen like a small movable window on a larger virtual view, without requiring any dedicated motion-detection hardware. As regards information retrieval, we will present an approach for information extraction for multi-page printed document; the approach is designed for scenarios in which the set of possible document classes, i.e., document sharing similar content and layout, is large and may evolve over time. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout; we will consider both a static and a dynamic scenario, in which document classes are/are not known a priori and new classes can/can not appear at any time. Finally, we will move to the edge of machine learning: Genetic Programming. The electric power market is increasingly relying on competitive mechanisms taking the form of day-ahead auctions, in which buyers and sellers submit their bids in terms of prices and quantities for each hour of the next day. We propose a novel forecasting method based on Genetic Programming; key feature of our proposal is the handling of outliers, i.e., regions of the input space rarely seen during the learning.Oggigiorno le risorse disponibili in termini computazionali e di archiviazione sono cresciute ad un livello tale da permettere facilmente di raccogliere e conservare enormi quantità di dati. Comunque, molte organizzazioni mancano ancora della conoscenza o degli strumenti necessari a processare tali dati in informazioni utili. In questo lavoro di tesi si investigheranno svariati problemi che possono essere efficacemente risolti attraverso strumenti di machine learning, spaziando dalla rilevazione di web defacement alla previsione dei prezzi della corrente elettrica, dalle Support Vector Machine al Genetic Programming. Si investigherà una infrastruttura per la rilevazione dei defacement studiata per permettere ad una organizzazione di sottoscrivere il servizio in modo semplice, fornendo l'URL da monitorare ed un contatto dell'amministratore. L'approccio presentato si basa sull'anomaly detection e permette di monitorare l'integrità di molte risorse web remote in modo automatico e sconnesso da esse, senza richiedere alcuna conoscenza a priori di tali risorse---ovvero, realizzando un sistema non supervisionato. A questo scopo verranno anche testati vari algoritmi di machine learning solitamente usati per la rilevazione di anomalie. Si presenterà poi un sistema di scorrimento da usare su dispositivi mobili capace di fornire una interfaccia naturale ed efficace anche su piccoli schermi. Il sistema rileva il movimento del dispositivo analizzando il flusso video generato dalla macchina fotografica integrata, trasformando lo spostamento rilevato in uno scorrimento del contenuto visualizzato sullo schermo. In questo modo, all'utente sembrerà che il proprio dispositivo sia una piccola finestra spostabile su una vista virtuale più ampia, senza che sia richiesto alcun dispositivo dedicato esclusivamente alla rilevazione dello spostamento. Verrà anche proposto un sistema per l'estrazione di informazioni da documenti stampati multi pagina; l'approccio è studiato per scenari in cui l'insieme di possibili classi di documenti (simili per contenuto ed organizzazione del testo) è ampio e può evolvere nel tempo. L'approccio si basa sulla probabilità: è stata studiata la probabilità che una sequenza di blocchi contenga l'informazione cercata. Un elemento chiave nel comprendere i documenti stampati è la loro classificazione in base alla natura delle informazioni che contengono e la loro posizione nel documento; verranno considerati sia uno scenario statico che uno dinamico, in cui il numero di classi di documenti è/non è noto a priori e nuove classi possono/non possono apparire nel tempo. Infine, ci si muoverà verso i confini del machine learning: il Genetic Programming. Il mercato della corrente elettrica si basa sempre più su aste in cui ogni giorno venditori ed acquirenti fanno delle offerte per l'acquisto di lotti di energia per il giorno successivo, con una granularità oraria della fornitura. Verrà proposto un nuovo metodo di previsione basato sul Genetic Programming; l'elemento chiave della soluzione qui presentata è la capacità di gestire i valori anomali, ovvero valori raramente osservati durante il processo di apprendimento.XXIII Ciclo198

    Camera-based Scrolling Interface for Hand-held Devices

    No full text

    The Reaction Time to Web Site Defacements

    No full text
    Web site defacement has become a common threat for organizations exposed on the web. There exist several statistics that indicate the number of incidents of this sort but there is a crucial piece of information still lacking: the typical duration of a defacement. Clearly, a defacement lasting one week is much more harmful than one of few minutes. In this paper we present the results of a two months monitoring activity that we performed over more than 62000 defacements in order to figure out whether and when} a reaction to the defacement is taken. We show that such time tends to be unacceptably long---in the order of several days---and with a long-tailed distribution. We believe our findings may improve the understanding of this phenomenon and highlight issues deserving attention by the research community

    The Reaction Time to Web Site Defacements

    No full text

    Human Colostrum and Breast Milk Contain High Levels of TNF-Related Apoptosis-Inducing Ligand (TRAIL).

    No full text
    Background: TNF-related apoptosis inducing ligand (TRAIL) is a pleiotropic cytokine, which plays a key role in the immune system as well as in controlling the balance of apoptosis and proliferation in various organs and tissues.Objective: To investigate the presence and levels of soluble TRAIL in human colostrum and milk.Methods: The levels of soluble human TRAIL were measured in human colostrum (day 2 after delivery) and breast milk (day 5 after delivery). The presence of TRAIL was also measured in infant formula.Results: Levels of soluble TRAIL in the colostrum and mature human milk were, respectively, at least 400 and 100 fold higher than those detected in human serum. No TRAIL was detected in formula.Conclusion: Human soluble TRAIL is present at extremely high levels in human colostrum and human milk and might have a significant role in mediating the anti-cancer activity of human milk
    corecore